library(readr)
White_wines <- read.table("~/Desktop/Big Data/Regression-1/White_wines.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
View(White_wines)
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/modules/R_de.so'' had status 1
# Import Data
#View(White_wines)

Summary of Data

Look at a summary of the data.

summary(White_wines)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Table 1: Summary Statistics

kable(summary(White_wines), format = "markdown")
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600 Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200 Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937 Median :3.180 Median :0.4700 Median :10.40 Median :6.000
Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391 Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878
3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800 Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000

This dataset is composed of 12 variables. The dependent variable of interest is quality. We will investigate the relationship between the remaining variables (fixed acidity, volatile acid, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol) and quality.

Univariate Analysis

Quality appears to be normally distributed with scores ranging from a minimum of 3 to a maximum of 9, with a mean score of of 5.88 and a median of 6.0. A boxplot of quality shows the potential of outliers. These should be considered when interpretting the remainder of the analysis.

Figure 1: Histogram of Quality

with(White_wines, Hist(quality, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 2: Boxplot of Quality

Boxplot( ~ quality, data=White_wines, id.method="y")

##  [1] "252"  "254"  "295"  "446"  "741"  "874"  "1035" "1230" "1418" "1485"
## [11] "775"  "821"  "828"  "877"  "1606" "18"   "21"   "23"   "69"   "75"

The distribiution of the remaining variables can be seen in the histograms below. Residual sugar, alcohol, and volatile acid appear to have right skewed distributions, while not perfect, the other variables appear to have a normal distribution.

Figure 3: Histogram of alcohol

with(White_wines, Hist(alcohol, scale="frequency", breaks="Sturges", 
  col="darkgray"))

Figure 4: Histogram of Chlorides

with(White_wines, Hist(chlorides, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 5: Histogram of Citric Acid

with(White_wines, Hist(citric.acid, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 6: Histogram of Density

with(White_wines, Hist(density, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 7: Histogram of Fixed Acidity

with(White_wines, Hist(fixed.acidity, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 8: Histogram of Free Sulfur Dioxide

with(White_wines, Hist(free.sulfur.dioxide, scale="frequency", 
  breaks="Sturges", col="darkgray"))

#Figure 9: Histogram of pH

with(White_wines, Hist(pH, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 10: Histogram of Residual Sugar

with(White_wines, Hist(residual.sugar, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 11: Histogram of Sulphates

with(White_wines, Hist(sulphates, scale="frequency", breaks="Sturges", 
  col="darkgray"))

#Figure 12: Histogram of Total Sulfur Dioxide

with(White_wines, Hist(total.sulfur.dioxide, scale="frequency", 
  breaks="Sturges", col="darkgray"))

#Figure 13: Histogram of Volatile Acidity

with(White_wines, Hist(volatile.acidity, scale="frequency", 
  breaks="Sturges", col="darkgray"))

Multivariate Analysis

To begin investigating potential relationships scattlot matrices have been run below.

Figure 13: Scatterplot Matrix: quality, alcohol, chlorides, citric acid.

scatterplotMatrix(~alcohol+chlorides+citric.acid+quality, reg.line=FALSE, 
  smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE, levels=c(.5, .9), 
  id.n=0, diagonal = 'density', data=White_wines)

#Figure 14: Scatterplot Matrix: quality, density, fixed acidity, free sulfur dioxide.

scatterplotMatrix(~density+fixed.acidity+free.sulfur.dioxide+quality, 
  reg.line=FALSE, smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE, 
  levels=c(.5, .9), id.n=0, diagonal = 'density', data=White_wines)

Figure 15: Scatterplot Matrix: quality, pH, residual sugar, sulphates.

scatterplotMatrix(~pH+quality+residual.sugar+sulphates, reg.line=FALSE, 
  smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE, levels=c(.5, .9), 
  id.n=0, diagonal = 'density', data=White_wines)

Figure 16: Scatterplot Matrix: quality, total sulfur dioxide, volatile acidity.

scatterplotMatrix(~quality+total.sulfur.dioxide+volatile.acidity, 
  reg.line=FALSE, smooth=FALSE, spread=FALSE, span=0.5, ellipse=FALSE, 
  levels=c(.5, .9), id.n=0, diagonal = 'density', data=White_wines)

Linear Correlation analysis shows:

kable(cor(White_wines[,c("alcohol","chlorides","citric.acid","density",
  "fixed.acidity","quality")], use="complete"))
alcohol chlorides citric.acid density fixed.acidity quality
alcohol 1.0000000 -0.3601887 -0.0757287 -0.7801376 -0.1208811 0.4355747
chlorides -0.3601887 1.0000000 0.1143644 0.2572113 0.0230856 -0.2099344
citric.acid -0.0757287 0.1143644 1.0000000 0.1495026 0.2891807 -0.0092091
density -0.7801376 0.2572113 0.1495026 1.0000000 0.2653310 -0.3071233
fixed.acidity -0.1208811 0.0230856 0.2891807 0.2653310 1.0000000 -0.1136628
quality 0.4355747 -0.2099344 -0.0092091 -0.3071233 -0.1136628 1.0000000
kable(cor(White_wines[,c("free.sulfur.dioxide","pH","quality","residual.sugar",
  "sulphates")], use="complete"))
free.sulfur.dioxide pH quality residual.sugar sulphates
free.sulfur.dioxide 1.0000000 -0.0006178 0.0081581 0.2990984 0.0592172
pH -0.0006178 1.0000000 0.0994272 -0.1941335 0.1559515
quality 0.0081581 0.0994272 1.0000000 -0.0975768 0.0536779
residual.sugar 0.2990984 -0.1941335 -0.0975768 1.0000000 -0.0266644
sulphates 0.0592172 0.1559515 0.0536779 -0.0266644 1.0000000
kable(cor(White_wines[,c("quality","total.sulfur.dioxide","volatile.acidity")], 
  use="complete"))
quality total.sulfur.dioxide volatile.acidity
quality 1.0000000 -0.1747372 -0.1947230
total.sulfur.dioxide -0.1747372 1.0000000 0.0892605
volatile.acidity -0.1947230 0.0892605 1.0000000
#Table 2:Correlation of Independent Variables With Wine Qua lity

Variable Correlation (r)

alcohol 0.435574715 chlorides -0.209934411 citric.acid -0.009209091 density -0.307123313 fixed.acidity -0.113662831 free.sulfur.dioxide 0.008158067 pH 0.099427246 residual.sugar -0.097576829 sulphates 0.053677877 total.sulfur.dioxide -0.1747372 volatile.acidity -0.1947230

There seems to be a weak positive relationship between alcohol and quality. Density, chlorides, total sulfur dioxide, and volatile acid, seem to have the strongest negative correlations with quality.

Linear Regressions

To further investigate potential relationships between quality and the variables linear regressions have been run below.

Regression Model Alcohol and quality.

RegModel.Alcohol <- lm(alcohol~quality, data=White_wines)
summary(RegModel.Alcohol)
## 
## Call:
## lm(formula = alcohol ~ quality, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2986 -0.7882 -0.1382  0.8014  4.1223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.95670    0.10626   65.47   <2e-16 ***
## quality      0.60524    0.01788   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.108 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Regression Model fixed.acidity and quality.

RegModel.fixed.acidity <- lm(fixed.acidity~quality, data=White_wines)
summary(RegModel.fixed.acidity)
## 
## Call:
## lm(formula = fixed.acidity ~ quality, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0416 -0.5499 -0.0499  0.4667  7.3584 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.49138    0.08042  93.152  < 2e-16 ***
## quality     -0.10830    0.01353  -8.005 1.48e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8385 on 4896 degrees of freedom
## Multiple R-squared:  0.01292,    Adjusted R-squared:  0.01272 
## F-statistic: 64.08 on 1 and 4896 DF,  p-value: 1.48e-15

Regression Model volatile.acidity and quality.

RegModel.volatile.acidity <- lm(volatile.acidity~quality, data=White_wines)
summary(RegModel.volatile.acidity)
## 
## Call:
## lm(formula = volatile.acidity ~ quality, data = White_wines)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20986 -0.06554 -0.01554  0.04446  0.78014 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.408504   0.009483   43.08   <2e-16 ***
## quality     -0.022161   0.001595  -13.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09888 on 4896 degrees of freedom
## Multiple R-squared:  0.03792,    Adjusted R-squared:  0.03772 
## F-statistic:   193 on 1 and 4896 DF,  p-value: < 2.2e-16

Regression Model citric.acid and quality.

RegModel.citric.acid <- lm(citric.acid~quality, data=White_wines)
summary(RegModel.citric.acid)
## 
## Call:
## lm(formula = citric.acid ~ quality, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3366 -0.0653 -0.0153  0.0547  1.3260 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.341588   0.011608  29.427   <2e-16 ***
## quality     -0.001258   0.001953  -0.644    0.519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.121 on 4896 degrees of freedom
## Multiple R-squared:  8.481e-05,  Adjusted R-squared:  -0.0001194 
## F-statistic: 0.4153 on 1 and 4896 DF,  p-value: 0.5193

Regression Model residual.sugar and quality.

RegModel.residual.sugar <- lm(residual.sugar~quality, data=White_wines)
summary(RegModel.residual.sugar)
## 
## Call:
## lm(formula = residual.sugar ~ quality, data = White_wines)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.300 -4.482 -1.023  3.412 59.477 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.67613    0.48420   19.98  < 2e-16 ***
## quality     -0.55882    0.08146   -6.86 7.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.048 on 4896 degrees of freedom
## Multiple R-squared:  0.009521,   Adjusted R-squared:  0.009319 
## F-statistic: 47.06 on 1 and 4896 DF,  p-value: 7.724e-12

Regression Model chlorides and quality.

RegModel.chlorides <- lm(chlorides~quality, data=White_wines)
summary(RegModel.chlorides)
## 
## Call:
## lm(formula = chlorides ~ quality, data = White_wines)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.042498 -0.009319 -0.003140  0.003860  0.295681 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0762135  0.0020490   37.20   <2e-16 ***
## quality     -0.0051789  0.0003447  -15.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02136 on 4896 degrees of freedom
## Multiple R-squared:  0.04407,    Adjusted R-squared:  0.04388 
## F-statistic: 225.7 on 1 and 4896 DF,  p-value: < 2.2e-16

Regression Model free.sulfur.dioxide and quality.

RegModel.free.sulfur.dioxide <- lm(free.sulfur.dioxide~quality, data=White_wines)
summary(RegModel.free.sulfur.dioxide)
## 
## Call:
## lm(formula = free.sulfur.dioxide ~ quality, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.171 -12.171  -1.484  10.516 254.143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.3872     1.6313  21.080   <2e-16 ***
## quality       0.1567     0.2744   0.571    0.568    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.01 on 4896 degrees of freedom
## Multiple R-squared:  6.655e-05,  Adjusted R-squared:  -0.0001377 
## F-statistic: 0.3259 on 1 and 4896 DF,  p-value: 0.5681

Regression Model total.sulfur.dioxide and quality.

RegModel.total.sulfur.dioxide <- lm(total.sulfur.dioxide~quality, data=White_wines)
summary(RegModel.total.sulfur.dioxide)
## 
## Call:
## lm(formula = total.sulfur.dioxide ~ quality, data = White_wines)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -144.107  -28.722   -2.337   28.278  277.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 187.6464     4.0138   46.75   <2e-16 ***
## quality      -8.3849     0.6752  -12.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.85 on 4896 degrees of freedom
## Multiple R-squared:  0.03053,    Adjusted R-squared:  0.03034 
## F-statistic: 154.2 on 1 and 4896 DF,  p-value: < 2.2e-16

Regression Model density and quality.

RegModel.density <- lm(density~quality, data=White_wines)
summary(RegModel.density)
## 
## Call:
## lm(formula = density ~ quality, data = White_wines)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.007718 -0.002104 -0.000361  0.001859  0.045079 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.000e+00  2.730e-04 3663.07   <2e-16 ***
## quality     -1.037e-03  4.593e-05  -22.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002847 on 4896 degrees of freedom
## Multiple R-squared:  0.09432,    Adjusted R-squared:  0.09414 
## F-statistic: 509.9 on 1 and 4896 DF,  p-value: < 2.2e-16

Regression Model pH and quality.

RegModel.pH <- lm(pH~quality, data=White_wines)
summary(RegModel.pH)
## 
## Call:
## lm(formula = pH ~ quality, data = White_wines)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47034 -0.10034 -0.01034  0.08966  0.61966 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.088623   0.014413 214.301  < 2e-16 ***
## quality     0.016952   0.002425   6.992 3.08e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1503 on 4896 degrees of freedom
## Multiple R-squared:  0.009886,   Adjusted R-squared:  0.009684 
## F-statistic: 48.88 on 1 and 4896 DF,  p-value: 3.081e-12

Regression Model sulphates and quality.

RegModel.sulphates <- lm(sulphates~quality, data=White_wines)
summary(RegModel.sulphates)
## 
## Call:
## lm(formula = sulphates ~ quality, data = White_wines)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27761 -0.08069 -0.01377  0.05931  0.58239 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.449189   0.010931  41.092  < 2e-16 ***
## quality     0.006917   0.001839   3.761 0.000171 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.114 on 4896 degrees of freedom
## Multiple R-squared:  0.002881,   Adjusted R-squared:  0.002678 
## F-statistic: 14.15 on 1 and 4896 DF,  p-value: 0.000171

From this we see neither citric acid nor free sulfur dioxide appear to have a significant linear relationship with quality.

Multiple Regressions

Now we will begin building a model using multiple regressions. However, prior to building the model, we will first split our dataset into a training and testing set.

set.seed(20170214) #Random Number seed is the date
White_wines$group <- runif(length(White_wines$quality), min = 0, max = 1) #create a new variable to add to dataset to distribute random numbers from 0-1

White_wines.train <- subset(White_wines, group <= 0.90) #assign 90% of the data to the training set
White_wines.test <- subset(White_wines, group > 0.90) #assign remaining data to the test set

#Did it work?
summary(White_wines.train)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.100  
##  Mean   : 6.851   Mean   :0.2784   Mean   :0.3337   Mean   : 6.342  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.800  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  3.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04574   Mean   : 35.28      Mean   :138.3       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH         sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.72   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.09   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.18   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.19   Mean   :0.4892   Mean   :10.52  
##  3rd Qu.:0.9960   3rd Qu.:3.28   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.82   Max.   :1.0800   Max.   :14.20  
##     quality          group          
##  Min.   :3.000   Min.   :0.0002833  
##  1st Qu.:5.000   1st Qu.:0.2285282  
##  Median :6.000   Median :0.4596618  
##  Mean   :5.879   Mean   :0.4570277  
##  3rd Qu.:6.000   3rd Qu.:0.6859608  
##  Max.   :9.000   Max.   :0.8998507
summary(White_wines.test)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 5.000   Min.   :0.0800   Min.   :0.0000   Min.   : 0.800  
##  1st Qu.: 6.400   1st Qu.:0.2175   1st Qu.:0.2600   1st Qu.: 2.100  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 6.300  
##  Mean   : 6.889   Mean   :0.2766   Mean   :0.3387   Mean   : 6.866  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.:10.400  
##  Max.   :10.200   Max.   :1.0050   Max.   :0.8800   Max.   :22.000  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01400   Min.   :  2.0       Min.   : 24.0       
##  1st Qu.:0.03675   1st Qu.: 23.0       1st Qu.:108.0       
##  Median :0.04300   Median : 35.0       Median :135.0       
##  Mean   :0.04612   Mean   : 35.6       Mean   :139.4       
##  3rd Qu.:0.05000   3rd Qu.: 47.0       3rd Qu.:170.2       
##  Max.   :0.20400   Max.   :124.0       Max.   :260.0       
##     density             pH          sulphates        alcohol     
##  Min.   :0.9877   Min.   :2.770   Min.   :0.280   Min.   : 8.40  
##  1st Qu.:0.9918   1st Qu.:3.080   1st Qu.:0.400   1st Qu.: 9.40  
##  Median :0.9941   Median :3.170   Median :0.480   Median :10.20  
##  Mean   :0.9942   Mean   :3.174   Mean   :0.496   Mean   :10.45  
##  3rd Qu.:0.9964   3rd Qu.:3.260   3rd Qu.:0.560   3rd Qu.:11.30  
##  Max.   :1.0010   Max.   :3.690   Max.   :1.010   Max.   :13.90  
##     quality          group       
##  Min.   :3.000   Min.   :0.9001  
##  1st Qu.:5.000   1st Qu.:0.9229  
##  Median :6.000   Median :0.9528  
##  Mean   :5.872   Mean   :0.9506  
##  3rd Qu.:6.000   3rd Qu.:0.9758  
##  Max.   :8.000   Max.   :0.9993

Now we will begin with a full model including all variables.

LinearModel.Full <- lm(quality ~ alcohol + chlorides + citric.acid + 
  density + fixed.acidity + free.sulfur.dioxide + pH + residual.sugar 
  + sulphates + total.sulfur.dioxide + volatile.acidity, 
  data=White_wines.train)
summary(LinearModel.Full)
## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density + 
##     fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity, data = White_wines.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8642 -0.4973 -0.0362  0.4704  3.0782 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.552e+02  1.937e+01   8.013 1.42e-15 ***
## alcohol               1.885e-01  2.510e-02   7.510 7.10e-14 ***
## chlorides            -2.444e-01  5.701e-01  -0.429 0.668114    
## citric.acid           4.294e-02  1.010e-01   0.425 0.670887    
## density              -1.555e+02  1.965e+01  -7.917 3.06e-15 ***
## fixed.acidity         8.103e-02  2.176e-02   3.724 0.000199 ***
## free.sulfur.dioxide   4.064e-03  8.870e-04   4.581 4.74e-06 ***
## pH                    7.268e-01  1.099e-01   6.614 4.19e-11 ***
## residual.sugar        8.492e-02  7.816e-03  10.865  < 2e-16 ***
## sulphates             6.578e-01  1.068e-01   6.156 8.10e-10 ***
## total.sulfur.dioxide -4.434e-04  3.963e-04  -1.119 0.263311    
## volatile.acidity     -1.822e+00  1.199e-01 -15.199  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7538 on 4426 degrees of freedom
## Multiple R-squared:  0.2805, Adjusted R-squared:  0.2787 
## F-statistic: 156.9 on 11 and 4426 DF,  p-value: < 2.2e-16

The full model can be used to explain 28% of the variability in taste. The F statistic is 156.9 and is highly significant. We will investigate what occurs as this model is reduced.

To continue we will use the backwards selection strategy and remove all variable that were not significant in the full model.

Reduced Model 1 will include alcohol, density, fixed acidity, free sulfur dioxide, pH, residal sugar, sulphates, volatile acidity.

LinearModel.2 <- lm(quality ~ alcohol +  density + fixed.acidity + 
  free.sulfur.dioxide +  pH + residual.sugar + sulphates +  volatile.acidity, 
  data=White_wines.train)
summary(LinearModel.2)
## 
## Call:
## lm(formula = quality ~ alcohol + density + fixed.acidity + free.sulfur.dioxide + 
##     pH + residual.sugar + sulphates + volatile.acidity, data = White_wines.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8536 -0.4930 -0.0388  0.4675  3.0889 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.598e+02  1.872e+01   8.535  < 2e-16 ***
## alcohol              1.888e-01  2.495e-02   7.566 4.64e-14 ***
## density             -1.603e+02  1.898e+01  -8.445  < 2e-16 ***
## fixed.acidity        8.386e-02  2.133e-02   3.931 8.58e-05 ***
## free.sulfur.dioxide  3.487e-03  7.137e-04   4.885 1.07e-06 ***
## pH                   7.325e-01  1.078e-01   6.792 1.25e-11 ***
## residual.sugar       8.639e-02  7.594e-03  11.377  < 2e-16 ***
## sulphates            6.524e-01  1.064e-01   6.130 9.57e-10 ***
## volatile.acidity    -1.861e+00  1.152e-01 -16.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7537 on 4429 degrees of freedom
## Multiple R-squared:  0.2802, Adjusted R-squared:  0.2789 
## F-statistic: 215.5 on 8 and 4429 DF,  p-value: < 2.2e-16

This reduced model can still be used to explain 28% of the variability in taste. The F statistic increased to 215.5 and is highly significant. We will investigate what occurs as this model is reduced.

View influential variables #Figure 17 Influential Observations fir model 2

#added variable plots
avPlots(LinearModel.2, id.n=2, id.cex=0.7)

#id.n - identify n most influential observations so you can pick out outlier values labeling them as farmers babysitters etc
#id.cex - controls the size of the dot

Figure 18: Studentized Residuals for model 2

# run the qq-plot
qqPlot(LinearModel.2, id.n=3)

## 4746  254 2782 
##    1    2 4438
# here, id.n identifies the n observations with the largest residuals in absolute value

Figure 19: Residuals for model 2

# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.2)

##                     Test stat Pr(>|t|)
## alcohol                 5.191    0.000
## density                 5.552    0.000
## fixed.acidity          -4.163    0.000
## free.sulfur.dioxide   -10.160    0.000
## pH                      0.880    0.379
## residual.sugar          2.520    0.012
## sulphates               0.729    0.466
## volatile.acidity        3.184    0.001
## Tukey test              2.551    0.011

Outliers

#run Bonferroni test for outliers
outlierTest(LinearModel.2)
##       rstudent unadjusted p-value Bonferonni p
## 4746 -5.285819         1.3116e-07   0.00058211
## 2782  4.931011         8.4800e-07   0.00376340
## 254  -4.496908         7.0712e-06   0.03138200
## 446  -4.485892         7.4449e-06   0.03304100

Figure 20: Influence Plot for model 2

#make influence plot
influencePlot(LinearModel.2, id.n=3)

##         StudRes         Hat       CookD
## 254  -4.4969082 0.002555952 0.005732828
## 1527 -0.6554449 0.038237083 0.001898028
## 1932 -3.7826585 0.015025299 0.024179469
## 2782  4.9310113 0.351726593 1.458130190
## 4746 -5.2858195 0.058668586 0.192314295

Figure 21: Heteroskedascity of model 2

#test for heteroskedasticity
ncvTest(LinearModel.2) #tests for non constant variance. 
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 16.04371    Df = 1     p = 6.189685e-05
vif(LinearModel.2)
##             alcohol             density       fixed.acidity 
##            7.337053           25.278419            2.545179 
## free.sulfur.dioxide                  pH      residual.sugar 
##            1.151922            2.083100           11.620123 
##           sulphates    volatile.acidity 
##            1.126483            1.060061
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there 

Based on the previous plots/analysis we further reduce the model. will remove density, and residual sugar, and free sulfur dioxide from the analysis.

Model 3

LinearModel.3 <- lm(quality ~ alcohol + fixed.acidity + 
  free.sulfur.dioxide +  pH + residual.sugar + sulphates +  volatile.acidity, 
  data=White_wines.train)
summary(LinearModel.3)
## 
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + free.sulfur.dioxide + 
##     pH + residual.sugar + sulphates + volatile.acidity, data = White_wines.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8938 -0.4962 -0.0333  0.4624  3.1774 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.7101754  0.3536637   4.836 1.37e-06 ***
## alcohol              0.3800365  0.0105722  35.947  < 2e-16 ***
## fixed.acidity       -0.0451858  0.0150020  -3.012 0.002610 ** 
## free.sulfur.dioxide  0.0037010  0.0007189   5.148 2.74e-07 ***
## pH                   0.1719520  0.0856708   2.007 0.044797 *  
## residual.sugar       0.0261723  0.0026316   9.946  < 2e-16 ***
## sulphates            0.3989868  0.1029242   3.877 0.000107 ***
## volatile.acidity    -2.0134448  0.1146832 -17.557  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7597 on 4430 degrees of freedom
## Multiple R-squared:  0.2686, Adjusted R-squared:  0.2675 
## F-statistic: 232.5 on 7 and 4430 DF,  p-value: < 2.2e-16

This reduced model can still be used to explain 27% of the variability in taste. The F statistic increased to 232.5 and is still highly significant. #Figure 23: Residual Plots for model 3

# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.3)

##                     Test stat Pr(>|t|)
## alcohol                 5.243    0.000
## fixed.acidity          -3.584    0.000
## free.sulfur.dioxide   -10.370    0.000
## pH                      0.386    0.700
## residual.sugar         -2.049    0.041
## sulphates               0.878    0.380
## volatile.acidity        1.968    0.049
## Tukey test              0.145    0.884

We will investigate what occurs as this model is further reduced by removing free sulfur dioxide.

Model 4

LinearModel.4 <- lm(quality ~ alcohol + fixed.acidity +  pH + residual.sugar + sulphates + volatile.acidity, 
  data=White_wines.train)
summary(LinearModel.4)
## 
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + pH + residual.sugar + 
##     sulphates + volatile.acidity, data = White_wines.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4043 -0.4962 -0.0369  0.4662  3.1503 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.914453   0.352441   5.432 5.87e-08 ***
## alcohol           0.373244   0.010520  35.481  < 2e-16 ***
## fixed.acidity    -0.051461   0.014995  -3.432 0.000605 ***
## pH                0.178963   0.085906   2.083 0.037286 *  
## residual.sugar    0.029424   0.002562  11.485  < 2e-16 ***
## sulphates         0.432488   0.103013   4.198 2.74e-05 ***
## volatile.acidity -2.080385   0.114271 -18.206  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7618 on 4431 degrees of freedom
## Multiple R-squared:  0.2643, Adjusted R-squared:  0.2633 
## F-statistic: 265.3 on 6 and 4431 DF,  p-value: < 2.2e-16

This reduced model can still be used to explain 26% of the variability in taste. The F statistic increased to 265.3 and is still highly significant.

Figure 24: residual Plots Model 4

# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.4)

##                  Test stat Pr(>|t|)
## alcohol              5.496    0.000
## fixed.acidity       -3.795    0.000
## pH                   0.165    0.869
## residual.sugar      -2.603    0.009
## sulphates            1.054    0.292
## volatile.acidity     1.760    0.078
## Tukey test          -0.203    0.839
vif(LinearModel.4)
##          alcohol    fixed.acidity               pH   residual.sugar 
##         1.276125         1.230980         1.293699         1.294518 
##        sulphates volatile.acidity 
##         1.032790         1.020648
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there 

Model 5

LinearModel.5 <- lm(quality ~ alcohol + fixed.acidity + residual.sugar + sulphates + volatile.acidity, 
  data=White_wines.train)
summary(LinearModel.5)
## 
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar + 
##     sulphates + volatile.acidity, data = White_wines.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3580 -0.4939 -0.0352  0.4642  3.1857 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.560289   0.167709   15.27  < 2e-16 ***
## alcohol           0.373620   0.010522   35.51  < 2e-16 ***
## fixed.acidity    -0.064489   0.013634   -4.73 2.31e-06 ***
## residual.sugar    0.028655   0.002536   11.30  < 2e-16 ***
## sulphates         0.468359   0.101603    4.61 4.15e-06 ***
## volatile.acidity -2.088819   0.114243  -18.28  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7621 on 4432 degrees of freedom
## Multiple R-squared:  0.2635, Adjusted R-squared:  0.2627 
## F-statistic: 317.2 on 5 and 4432 DF,  p-value: < 2.2e-16

This reduced model can still be used to explain 26% of the variability in taste. The F statistic increased to 317.2 and is still highly significant. An increase in the adjusted R squared indicates it may fit better than the previous model.

Figure 25: Residual Plots Model 5

# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.5)

##                  Test stat Pr(>|t|)
## alcohol              5.143    0.000
## fixed.acidity       -3.568    0.000
## residual.sugar      -2.491    0.013
## sulphates            0.929    0.353
## volatile.acidity     1.930    0.054
## Tukey test          -0.471    0.637
vif(LinearModel.5)
##          alcohol    fixed.acidity   residual.sugar        sulphates 
##         1.275751         1.016852         1.267666         1.003936 
## volatile.acidity 
##         1.019367
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there 

Based on the residuals I would like to see what happens when residual sugar and fixed acidity are removed from the model.

Model 6

LinearModel.6 <- lm(quality ~ alcohol + sulphates + volatile.acidity, 
  data=White_wines.train)
summary(LinearModel.6)
## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity, 
##     data = White_wines.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3158 -0.4886 -0.0468  0.4947  3.1571 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.785383   0.116676  23.873  < 2e-16 ***
## alcohol           0.325036   0.009493  34.239  < 2e-16 ***
## sulphates         0.434956   0.103148   4.217 2.53e-05 ***
## volatile.acidity -1.936946   0.115352 -16.792  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7744 on 4434 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2388 
## F-statistic:   465 on 3 and 4434 DF,  p-value: < 2.2e-16

This model does not appear to fit better than the previous model. It only accounts for 24% of the variability and while still significant the adjust R squared value has decreased from .26 to .24.

compareCoefs(LinearModel.2, LinearModel.3, LinearModel.4, LinearModel.5, 
  LinearModel.Full)
## 
## Call:
## 1: lm(formula = quality ~ alcohol + density + fixed.acidity + 
##   free.sulfur.dioxide + pH + residual.sugar + sulphates + 
##   volatile.acidity, data = White_wines.train)
## 2: lm(formula = quality ~ alcohol + fixed.acidity + 
##   free.sulfur.dioxide + pH + residual.sugar + sulphates + 
##   volatile.acidity, data = White_wines.train)
## 3: lm(formula = quality ~ alcohol + fixed.acidity + pH + 
##   residual.sugar + sulphates + volatile.acidity, data = 
##   White_wines.train)
## 4: lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar + 
##   sulphates + volatile.acidity, data = White_wines.train)
## 5: lm(formula = quality ~ alcohol + chlorides + citric.acid + density 
##   + fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##   sulphates + total.sulfur.dioxide + volatile.acidity, data = 
##   White_wines.train)
##                         Est. 1      SE 1    Est. 2      SE 2    Est. 3
## (Intercept)           1.60e+02  1.87e+01  1.71e+00  3.54e-01  1.91e+00
## alcohol               1.89e-01  2.50e-02  3.80e-01  1.06e-02  3.73e-01
## density              -1.60e+02  1.90e+01                              
## fixed.acidity         8.39e-02  2.13e-02 -4.52e-02  1.50e-02 -5.15e-02
## free.sulfur.dioxide   3.49e-03  7.14e-04  3.70e-03  7.19e-04          
## pH                    7.32e-01  1.08e-01  1.72e-01  8.57e-02  1.79e-01
## residual.sugar        8.64e-02  7.59e-03  2.62e-02  2.63e-03  2.94e-02
## sulphates             6.52e-01  1.06e-01  3.99e-01  1.03e-01  4.32e-01
## volatile.acidity     -1.86e+00  1.15e-01 -2.01e+00  1.15e-01 -2.08e+00
## chlorides                                                             
## citric.acid                                                           
## total.sulfur.dioxide                                                  
##                           SE 3    Est. 4      SE 4    Est. 5      SE 5
## (Intercept)           3.52e-01  2.56e+00  1.68e-01  1.55e+02  1.94e+01
## alcohol               1.05e-02  3.74e-01  1.05e-02  1.88e-01  2.51e-02
## density                                            -1.56e+02  1.96e+01
## fixed.acidity         1.50e-02 -6.45e-02  1.36e-02  8.10e-02  2.18e-02
## free.sulfur.dioxide                                 4.06e-03  8.87e-04
## pH                    8.59e-02                      7.27e-01  1.10e-01
## residual.sugar        2.56e-03  2.87e-02  2.54e-03  8.49e-02  7.82e-03
## sulphates             1.03e-01  4.68e-01  1.02e-01  6.58e-01  1.07e-01
## volatile.acidity      1.14e-01 -2.09e+00  1.14e-01 -1.82e+00  1.20e-01
## chlorides                                          -2.44e-01  5.70e-01
## citric.acid                                         4.29e-02  1.01e-01
## total.sulfur.dioxide                               -4.43e-04  3.96e-04
# compare the results of the two regression models
stargazer(LinearModel.4,LinearModel.5, LinearModel.6,title="Comparison of Regression outputs",type="text",align=TRUE)

Comparison of Regression outputs

                                             Dependent variable:                             
                -----------------------------------------------------------------------------
                                                   quality                                   
                           (1)                       (2)                       (3)           
alcohol 0.373*** 0.374*** 0.325*** (0.011) (0.011) (0.009)
fixed.acidity -0.051*** -0.064*** (0.015) (0.014)
pH 0.179** (0.086)
residual.sugar 0.029*** 0.029*** (0.003) (0.003)
sulphates 0.432*** 0.468*** 0.435*** (0.103) (0.102) (0.103)
volatile.acidity -2.080*** -2.089*** -1.937*** (0.114) (0.114) (0.115)
Constant 1.914*** 2.560*** 2.785*** (0.352) (0.168) (0.117)

Observations 4,438 4,438 4,438
R2 0.264 0.264 0.239
Adjusted R2 0.263 0.263 0.239
Residual Std. Error 0.762 (df = 4431) 0.762 (df = 4432) 0.774 (df = 4434)
F Statistic 265.260*** (df = 6; 4431) 317.205*** (df = 5; 4432) 465.047*** (df = 3; 4434) ================================================================================================= Note: p<0.1; p<0.05; p<0.01

#can only be seen when knitting to html if you change type to text you can see the table now type=html or text or latek as options
#test for heteroskedasticity
ncvTest(LinearModel.5) #tests for non constant variance. All biomarkers fail this test. since p is big its a homoskedastic set
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 25.17967    Df = 1     p = 5.222981e-07
vif(LinearModel.5)
##          alcohol    fixed.acidity   residual.sugar        sulphates 
##         1.275751         1.016852         1.267666         1.003936 
## volatile.acidity 
##         1.019367
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there 

Figure 26

#make influence plot
influencePlot(LinearModel.5, id.n=3)

##         StudRes          Hat       CookD
## 254  -4.4170724 0.0008409785 0.002725574
## 446  -4.2524949 0.0009687925 0.002911503
## 741  -4.3592646 0.0013393889 0.004230614
## 1418 -3.5644279 0.0041885139 0.008883125
## 1527  0.6459469 0.0181910902 0.001288639
## 2051 -3.3219899 0.0081047808 0.014994727
## 2782 -0.8366248 0.0497611466 0.006109381
## 4040 -1.1217840 0.0154427311 0.003289463
## 4481 -3.0490524 0.0064404365 0.010025076

Based on this data I believe LinearModel.5 to be the best model of this data. Currently the model accounts for 26% of the variability in the score for quality. While I am not pleased with the plots of the residuals or influential points, and I would also like to include less variables in the model. However I am unsure how much it is aceptable to balance these flaws for the amount of variability accounted for by the model.

Testing the Model

We will now run the model on the testing dataset.

LinearModel.test <- lm(quality ~ alcohol + fixed.acidity + residual.sugar + sulphates + volatile.acidity, 
  data=White_wines.test)
summary(LinearModel.test)
## 
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar + 
##     sulphates + volatile.acidity, data = White_wines.test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71904 -0.47984 -0.03971  0.46703  2.62025 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.837360   0.501036   7.659 1.14e-13 ***
## alcohol           0.340385   0.031312  10.871  < 2e-16 ***
## fixed.acidity    -0.165806   0.041642  -3.982 7.97e-05 ***
## residual.sugar    0.015319   0.007811   1.961   0.0505 .  
## sulphates         0.248848   0.269160   0.925   0.3557    
## volatile.acidity -2.203158   0.346791  -6.353 5.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7234 on 454 degrees of freedom
## Multiple R-squared:  0.3123, Adjusted R-squared:  0.3047 
## F-statistic: 41.23 on 5 and 454 DF,  p-value: < 2.2e-16

Running this model on the test data provides a significant F statistic, with the model explaining 31% of the variability in the score for quality. This model uses alcohol, fixed acidity, residual sugar, sulphates, volatile acidity to explain quality. This equation for this model is:

Y = 4.10 + (0.34)x1 + (-0.17)x2 + (0.02)x3 + (0.25)x4 + (-2.20)x5 + E

Where: Y= quality x1= alcohol x2= fixed acidity x3= residual sugar x4= sulphates x5= volatile acid E= Error

Using this model it appears volatile acid influences quality the most. When keeping the other variables constant a 1 point change in volatile acid will cause a -2.20 change in the quality score of the wine.

Conclusion

Running the model on the full data set uses alcohol, fixed acidity, pH, residual sugar, sulphates, volatile acidity to explain quality to explain 27% of variability in quality score of the wine. Investigating the usefulness of model can be completed using the diagnositc test seen in the plots below.

LinearModel.testfull <- lm(quality ~ alcohol + fixed.acidity + residual.sugar + sulphates + volatile.acidity, 
  data=White_wines)
summary(LinearModel.testfull)
## 
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity + residual.sugar + 
##     sulphates + volatile.acidity, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3693 -0.4939 -0.0341  0.4634  3.2090 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.671921   0.158935  16.811  < 2e-16 ***
## alcohol           0.370957   0.009972  37.199  < 2e-16 ***
## fixed.acidity    -0.073315   0.012958  -5.658 1.62e-08 ***
## residual.sugar    0.027559   0.002412  11.427  < 2e-16 ***
## sulphates         0.443294   0.095155   4.659 3.27e-06 ***
## volatile.acidity -2.102798   0.108510 -19.379  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7588 on 4892 degrees of freedom
## Multiple R-squared:  0.2667, Adjusted R-squared:  0.266 
## F-statistic: 355.9 on 5 and 4892 DF,  p-value: < 2.2e-16

Figure 27: Residual Plots Model 5 on full dataset

# diagnostics for the first model with 3 independent variables
residualPlots(LinearModel.testfull)

##                  Test stat Pr(>|t|)
## alcohol              5.264    0.000
## fixed.acidity       -3.765    0.000
## residual.sugar      -2.360    0.018
## sulphates            0.432    0.666
## volatile.acidity     2.534    0.011
## Tukey test          -0.652    0.514

Figure 28: Residual Plots Model 5 on full dataset

# diagnostics for the second model with 2 independent variables
residualPlots(LinearModel.testfull)

##                  Test stat Pr(>|t|)
## alcohol              5.264    0.000
## fixed.acidity       -3.765    0.000
## residual.sugar      -2.360    0.018
## sulphates            0.432    0.666
## volatile.acidity     2.534    0.011
## Tukey test          -0.652    0.514

Figure 29: Added Variable Plots Model 5 on full dataset

#added variable plots
avPlots(LinearModel.testfull, id.n=2, id.cex=0.7)

#id.n - identify n most influential observations so you can pick out outlier values labeling them as farmers babysitters etc
#id.cex - controls the size of the dot

Figure 30: Studentized Residuals Model 5 on full dataset

# run the qq-plot
qqPlot(LinearModel.testfull, id.n=3)

## 254 741 446 
##   1   2   3
# here, id.n identifies the n observations with the largest residuals in absolute value

Outliers Model 5 on full dataset

#run Bonferroni test for outliers
outlierTest(LinearModel.testfull)
##      rstudent unadjusted p-value Bonferonni p
## 254 -4.450691         8.7497e-06     0.042856

Figure 31: Influential Points Model 5 on full dataset

#identify highly influential points
influenceIndexPlot(LinearModel.testfull, id.n=3)

Figure 32: Influence Plot Model 5 on full dataset

#make influence plot
influencePlot(LinearModel.testfull, id.n=3)

##         StudRes          Hat        CookD
## 254  -4.4506905 0.0007658563 2.520676e-03
## 446  -4.2595629 0.0008827298 2.662385e-03
## 741  -4.3741368 0.0012254319 3.898059e-03
## 1418 -3.5339258 0.0037893386 7.898727e-03
## 1527  0.7303460 0.0165631880 1.497425e-03
## 1952  0.1694621 0.0147172029 7.150637e-05
## 2051 -3.2721210 0.0073798698 1.324074e-02
## 2782 -0.7153193 0.0454303839 4.059110e-03
## 4481 -3.0466504 0.0058563248 9.097779e-03

Heteroskedasticity Model 5 on full dataset

#test for heteroskedasticity
ncvTest(LinearModel.testfull) #tests for non constant variance. All biomarkers fail this test. since p is big its a homoskedastic set
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 26.55193    Df = 1     p = 2.565479e-07

Test for Independence Model 5 on full dataset

vif(LinearModel.testfull)
##          alcohol    fixed.acidity   residual.sugar        sulphates 
##         1.280977         1.017086         1.272816         1.003092 
## volatile.acidity 
##         1.017465
#if higher than 4 we want to take variable out b/c it is not independent and highly correlates with something in there 

Running this model on the full dataset provides a significant F statistic, with the model explaining 27% of the variability in the score for quality. This model uses alcohol, fixed acidity, residual sugar, sulphates, volatile acidity to explain quality. This equation for this model is:

Y = 4.10 + (0.37)x1 + (-0.07)x2 + (0.03)x3 + (0.44)x4 + (-2.10)x5 + E

Where: Y= quality x1= alcohol x2= fixed acidity x3= residual sugar x4= sulphates x5= volatile acid E= Error

Using this model it still appears volatile acid has the largest influence on quality of wine. When keeping the other variables constant a 1 point change in volatile acid will cause a -2.20 change in the quality score of the wine. This seems to agree with how wine is tested to identify if it is spoiled, as volatile acid is traditionally used a measure of wine spoilage. Volatile acid is a measure of a variety of biproducts that have accumulated in wine. These biproducts include acetic, lactic, formic, butyric, and propionic acids. There are legal limits of the volatile acid allowed in a batch of wine. Levels higher than the legal amount indicate the wine has over fermented (Neeley, 2004).

Louis Pasteur sought to discover the cause of alcohol spoiling in 1857, and as a result discovered acetic acid producing bacteria as the culprit. The aerobic nature of these bacteria cause this process occurs faster in the presence of oxygen, and is the reason many tools exist to vaccum the oxygen out of a bottle of wine. The findings within this dataset appear to agree with the findings of Louis Pasteur.

References

Neeley, E. (2004). Volatile Acidity. Waterhouse Lab: UC Davis. Retrieved from http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity

The rmarkdown used to create this file can be found at https://github.com/amonda/Regression-1. The file is named New.Rmd.